Profiling with IPython

Sometimes our scripts take a lot of time or memory to run. That happens especially for machine learning tasks when we train our algorithms on a large data sets. Profiling a program is a way of finding out where it becomes slow and where exactly its memory bottlenecks are. Luckily, IPython makes it very easy to profile a python script.

Let's assume we have a few functions that different calculations. We define them here directly but they could also come from your .py file and are imported into IPython.



In [ ]:

    
from time import sleep

def foo():
    print 'foo: calculating heavy stuff...'
    sleep(1)

def bar():
    print 'bar: calculating heavy stuff...'
    sleep(2)

def baz():
    foo()
    bar()

IPython magic function

IPython provides a few so-called magic functions that start with % and are not part of the Python language. For profiling there are %time and %timeit.



In [ ]:

    
%time baz()



In [ ]:

    
%timeit baz()

As you can see, %time runs a function and measures its execution time. %timeit runs a function several times and outputs statistics about the execution times.

Line-wise profiling

Very often though we have the know perfectly well that our fuction foo is slow, we just don't know which party exactly make it that slow. That's where we can load the line_profiler extension and output statistics about every single line.

Usage: %lprun -f function_to_profile function_to_call()

Let's call baz() and output the details for foo() and baz():



In [ ]:

    
%load_ext line_profiler
%lprun -f foo -f baz baz()

Memory

Memory profiling works very similar. Unfortunately, we don't have necessary dependencies met on the Windows machines here. We still show one example how it could look like.

Let's start with a function that uses a bit of memory:

def moo():
    a = [1] * (10 ** 6)
    b = [2] * (2 * 10 ** 7)
    del b
    return a

Again there is an IPython magic function called %memit:

%memit moo()

And there is a memoriy_profiler module similar to line_profiler. However, it only works if the function to profile is a file, which is why we can't run it here in the Notebook. Running the folling code ...

%load_ext memory_profiler

from my_module.py import moo

%mprun -f moo moo()

... would look like this:

Line Mem usage Increment Line Contents

 4      5.97 MB    0.00 MB   def my_func():
 5     13.61 MB    7.64 MB       a = [1] * (10 ** 6)
 6    166.20 MB  152.59 MB       b = [2] * (2 * 10 ** 7)
 7     13.61 MB -152.59 MB       del b
 8     13.61 MB    0.00 MB       return a

Exercise

We prepared an example programm that reads a text corpus of one million words and calculates some simple statistics for this text. First it creates a list of unique words and second it counts how often each of these words occur. First some imports (NLTK is Natural Language Toolkit)!



In [ ]:

    
import numpy as np
import nltk
# download Brown corpus if neccessary
nltk.download('brown')
# make all words from corpus lower-case
brown = [word.lower() for word in nltk.corpus.brown.words()]

Start the following code and get yourself a cup of coffee...



In [ ]:

    
def print_statistics_slow(corpus):

    # data structures for counting words
    word_list = []
    counter_list = []

    # calculate statistics of words
    for word in corpus:
    
        if word in word_list:
            index = word_list.index(word)
            counter_list[index] = counter_list[index] + 1
        else:
            word_list.append(word)
            counter_list.append(1)
            
    # create a list of indices sorted according to word occurrances
    sorted_indices = np.argsort(counter_list)

    # only keep last ten entries of the index list
    sorted_indices = sorted_indices[-10:]

    # switch order of list
    sorted_indices = sorted_indices[::-1]

    print 'most common words:'
    for index in sorted_indices:
        print '%s (%d)' % (word_list[index], counter_list[index])
    
    print '\nOverall there are %d unique words in the corpus.' % len(word_list)
    
# statistics for whole corpus (may take 15 minutes)
print_statistics_slow(brown)

# statistics for 100.000 first words only
#print_statistics_slow(brown[:100000])

Okay, nice. But what went wrong that the script became so slooow?!? Use the line_profiler to find out.



In [ ]:

    
# profiling for whole corpus (15 minutes again)
%lprun -f print_statistics_slow print_statistics_slow(brown)

# profiling for first 100.000 words only
#%lprun -f print_statistics_slow print_statistics_slow(brown[:100000])

It turns out that we have used the data structures in a very naive and inefficient way. Re-write the method that it runs faster. line_profiler may help you to gradually remove bottlenecks. In the end the method shouldn't need more than two or three seconds.

Hint: Think about using data structures dict and set instead of lists when appropriate. A set allows membership testing much more efficiently than a list. With a dict you can quickly map from a key to a value, for instance: word -> index.

Also this article about sorting may be usefull: https://wiki.python.org/moin/HowTo/Sorting



In [ ]:

    
def print_statistics_fast(corpus):
    # your code here

%lprun -f print_statistics_fast print_statistics_fast(brown)

Our solution:



In [ ]:

    
from solutions import *
decrypt_solution(solution_profiling, passphrase='foo')